Evaluation of text clustering methods using wordnet

نویسندگان

  • Abdelmalek Amine
  • Zakaria Elberrichi
  • Michel Simonet
چکیده

The increasing number of digitized texts presently available notably on the Web has developed an acute need in text mining techniques. Clustering systems are used more and more often in text mining, especially to analyze texts and to extract knowledge they contain. With the availability of the vast amount of clustering algorithms and techniques, it becomes highly confusing to a user to choose the algorithm that best suits its target dataset. Actually, it is very hard to define which algorithms work the best, since results depend considerably on the application and on the kinds of data at hand. In this paper, we propose, study and compare three text clustering methods: an ascending hierarchical clustering method, a SOM-based clustering method and an ant-based clustering method, all of these based on the synsets of WordNet as terms for the representation of textual documents. The effects of these methods are examined in several experiments using 3 similarity measurements: the cosine distance, the Euclidean distance and the manhattan distance. The reuters-21578 corpus is used for evaluation. The evaluation was done, by using the F-measure. The results obtained show that the SOM-based clustering method using the cosine distance provides the best results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

WordNet Based Multi-Way Concept Hierarchy Construction from Text Corpus

In this paper, we propose an approach to build a multi-way concept hierarchy from a text corpus, which is based on WordNet and multi-way hierarchical clustering. In addition, a new evaluation metric is presented, and our approach is compared with 4 kinds of existing methods on the Amazon Customer Review data set.

متن کامل

A Comparative Analysis of Particle Swarm Optimization and K-means Algorithm For Text Clustering Using Nepali Wordnet

The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection of data on the web there is a need for grouping(clustering) the documents into clusters for speedy information retrieval. Clustering of documents is collection of documents into groups such that the documents within each group are similar to each other and not to documents of other groups...

متن کامل

Wordnet improves Text Document Clustering

Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. The bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with the pro...

متن کامل

Automatic Construction of Persian ICT WordNet using Princeton WordNet

WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...

متن کامل

Enhancing Text Document Clustering Using Non-negative Matrix Factorization and WordNet

A classic document clustering technique may incorrectly classify documents into different clusters when documents that should belong to the same cluster do not have any shared terms. Recently, to overcome this problem, internal and external knowledge-based approaches have been used for text document clustering. However, the clustering results of these approaches are influenced by the inherent s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Int. Arab J. Inf. Technol.

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2010